From Raw to Technically Correct Data

(from de Jonge van der Loo)

A data set is a collection of data that describes attribute values (variables) of a number of real-world objects (units). With data that are technically correct, we understand a data set where each value:

can be directly recognized as belonging to certain variable, and
is stored in a data type that represents the value domain of the real-world variable.

This means that for each unit, a text variable should be stored as text, a numeric variable as a number, and so on, and all this in a format that is consistent across the data set with appropriate variable (column) names. I am avoiding to comment the case in which the variable is categorical, as thus it would be necessary to use labels. I'll skip this, but you have further reading about this in the following link: http://pandas.pydata.org/pandas-docs/stable/categorical.html

The Objective is to:

load data in a DataFrame with suitable columns and index names
have each column of the DataFrame is of the type that adequately represents the value domain of the variable in the column
not loose data (or at least as less as possible)

Note that:

we don't check if the data makes sense

Reading text data to DataFrame

We have already seen in previous notes that Pandas already provides libraries that get a file as an input and provide a DataFrame as an output. This is the most suitable way of reading text in Pandas, however, we always can make use of the csv library from Python, converting and converting the data to a DataFrame using any of the DataFrame constructors.

We will cover some of the most common uses of Pandas regarding the csv file format reading libraries, for other file formats and further options, please refer to http://pandas.pydata.org/pandas-docs/stable/io.html.

CSV reading

Nowadays, the most common data format is CSV, which are tabular data files which use the comma separator to divide variable data. There are two main kinds of CSV standards: comma and semi-colon separated files.



In [ ]:

    
fname = "../data/people.csv"
with open(fname) as f:
    content = f.readlines()
print(content[:5])

Pandas provide read_csv function, which takes a csv file as input, and returns a DataFrame.



In [1]:

    
import pandas as pd
from IPython.display import display, HTML

path_to_file = "../data/people.csv"
df = pd.read_csv(path_to_file)
display(df)
print("Types:")
display(df.dtypes)
print(df.describe())









    







  
    
      
      Unnamed: 0
      Age[years]
      Sex
      Weight[kg]
      Eye Color
      Body Temperature[C]
    
  
  
    
      0
      individuum 1
      42
      female
      52.9
      brown
      36.9
    
    
      1
      individuum 2
      37
      male
      87.0
      green
      36.3
    
    
      2
      individuum 3
      29
      male
      82.1
      blue
      36.4
    
    
      3
      individuum 4
      61
      female
      62.5
      blue
      36.7
    
    
      4
      individuum 5
      77
      female
      55.5
      gray
      36.6
    
    
      5
      individuum 6
      33
      male
      95.2
      green
      36.5
    
    
      6
      individuum 7
      32
      female
      81.8
      brown
      37.0
    
    
      7
      individuum 8
      45
      male
      78.9
      brown
      36.3
    
    
      8
      individuum 9
      18
      male
      83.4
      green
      36.6
    
    
      9
      individuum 10
      19
      male
      84.7
      gray
      36.1
    
  








    



Types:






    





Unnamed: 0              object
Age[years]               int64
Sex                     object
Weight[kg]             float64
Eye Color               object
Body Temperature[C]    float64
dtype: object






    



       Age[years]  Weight[kg]  Body Temperature[C]
count   10.000000   10.000000            10.000000
mean    39.300000   76.400000            36.540000
std     18.263808   14.274531             0.279682
min     18.000000   52.900000            36.100000
25%     29.750000   66.600000            36.325000
50%     35.000000   81.950000            36.550000
75%     44.250000   84.375000            36.675000
max     77.000000   95.200000            37.000000



In [2]:

    
df.columns









    Out[2]:





Index(['Unnamed: 0', 'Age[years]', 'Sex', 'Weight[kg]', 'Eye Color',
       'Body Temperature[C]'],
      dtype='object')

Some things to note:

recognizes the first line as column names
recognizes variable types
automatically assign row index

Useful options of read_csv:

sep : str, defaults to ',' for read_csv(), \t for read_table()

Delimiter to use. If sep is None, will try to automatically determine this. Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data. Regex example: '\r\t'.

header : int or list of ints, default 'infer'

Row number(s) to use as the column names, and the start of the data. Default behavior is as if header=0 if no names passed, otherwise as if header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of ints that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

names : array-like, default None

List of column names to use. If file contains no header row, then you should explicitly pass header=None.

index_col : int or sequence or False, default None

Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names).

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object to preserve and not interpret dtype.

skiprows : list-like or integer, default None

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

skipfooter : int, default 0

Number of lines at bottom of file to skip (unsupported with engine=’c’).

nrows : int, default None

Number of rows of file to read. Useful for reading pieces of large files.
na_values : str, list-like or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''.

XLS reading

Another common file format is the xls file that contains an excel spreadsheet. In this case, each cell is a DataFrame cell, and columns are DataFrame columns and the same holds for rows.



In [3]:

    
import pandas as pd

country_df = pd.read_excel("../data/country_info_worldbank.xls",skiprows=[0,1,2],header=1)
country_df.head(8)









    Out[3]:







  
    
      
      Unnamed: 0
      Unnamed: 1
      Economy
      Code
      Unnamed: 4
      Region
      Income group
      Lending category
      Other
    
  
  
    
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1
      NaN
      Afghanistan
      AFG
      NaN
      South Asia
      Low income
      IDA
      HIPC
    
    
      2
      2
      NaN
      Albania
      ALB
      NaN
      Europe & Central Asia
      Upper middle income
      IBRD
      NaN
    
    
      3
      3
      NaN
      Algeria
      DZA
      NaN
      Middle East & North Africa
      Upper middle income
      IBRD
      NaN
    
    
      4
      4
      NaN
      American Samoa
      ASM
      NaN
      East Asia & Pacific
      Upper middle income
      ..
      NaN
    
    
      5
      5
      NaN
      Andorra
      ADO
      NaN
      Europe & Central Asia
      High income
      ..
      NaN
    
    
      6
      6
      NaN
      Angola
      AGO
      NaN
      Sub-Saharan Africa
      Upper middle income
      IBRD
      NaN
    
    
      7
      7
      NaN
      Antigua and Barbuda
      ATG
      NaN
      Latin America & Caribbean
      High income
      IBRD
      NaN

Some things to note:

Useful options of read_excel:

sheetname : string, int, mixed list of strings/ints, or None, default 0

Strings are used for sheet names, Integers are used in zero-indexed sheet positions.

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

Available Cases
```
  Defaults to 0 -> 1st sheet as a DataFrame
  1 -> 2nd sheet as a DataFrame
  “Sheet1” -> 1st sheet as a DataFrame
  [0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
  None -> All sheets as a dictionary of DataFrames
```
header : int, list of ints, default 0

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex
skiprows : list-like

Rows to skip at the beginning (0-indexed)
index_col : int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex

SQL tables

The use of SQL databases as data source is a very useful tool, however it won't be covered in these notes.

The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API.

Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database.

Here are some examples:

http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql

Unstructured data

In previous sessions we have seen how to create DataFrames using dicts and lists. This option is still a valid option to load textfiles as a DataFrame, just by open a file as a regular text file and the parsing lines.

To convert the text file to a DataFrame we will follow 5 steps:

1: Read data

We can do this with the common Python approach for reading a text file.



In [17]:

    
import re
import pandas as pd

fname = "../data/unstructured_data.txt"
with open(fname) as f:
    content = f.readlines()



In [2]:

    
content









    Out[2]:





['# data from experiment A\n',
 '/Name/Age/Time\n',
 'Alice,22,23:20\n',
 'BOB,,26:43\n',
 '#the following people has a penalization of 3minutes\n',
 'Charlie ,"22",24:09\n',
 ' David,,\n',
 '##############\n']

2: Select lines containing data

A very useful way to do this is by using regular expressions.



In [18]:

    
data = [l for l in content if re.match("^(?!#).+",l)]
data
header = data[0].strip("\n")
data = data[1:]
print([header] + data)









    



['/Name/Age/Time', 'Alice,22,23:20\n', 'BOB,,26:43\n', 'Charlie ,"22",24:09\n', ' David,,\n']

3: Split data into separate fields

This can be done by applying string's split function



In [19]:

    
data_str = [header.split("/")[1:]]+[l.split(",") for l in data]
data_str









    Out[19]:





[['Name', 'Age', 'Time'],
 ['Alice', '22', '23:20\n'],
 ['BOB', '', '26:43\n'],
 ['Charlie ', '"22"', '24:09\n'],
 [' David', '', '\n']]

4: Convert to DataFrame

To standarize rows we will load the structures that we already have to a DataFrame. Remember that the constructor of a DataFrame using list needs the list to be of the same length.



In [20]:

    
df = pd.DataFrame(data_str[1:], columns=pd.Series(data_str[0]))
df

5: Normalize and coerce to correct types

We can use to_numeric. This function accepts the argument error:

errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

  If ‘raise’, then invalid parsing will raise an exception
  If ‘coerce’, then invalid parsing will be set as NaN
  If ‘ignore’, then invalid parsing will return the input



In [6]:

    
pd.to_numeric(df["Age"],errors="ignore")









    Out[6]:





0      22
1        
2    "22"
3        
Name: Age, dtype: object

Applying Operations in Pandas

Both DataFrames and Series have methods to apply functions over elements in the structure. These operations take a function (general, user defined or lambda) and apply it to each element value in the DataFrame or Series row or column wise.

For DataFrames we have:

apply: applies the function to each column of the DataFrame
applymap: applies the function to each cell of the DataFrame



In [5]:

    
import numpy as np


df2 = pd.DataFrame(np.random.randn(4, 3), 
                   columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df2)
f = lambda x: x.max() - x.min()
print(df2.apply(lambda x: x.max() - x.min(),axis=0))
print(df2.apply(f,axis=1))

g = lambda x: x**2
print(df2.applymap(g))









    



               b         d         e
Utah    1.307807 -0.619167 -0.696909
Ohio    0.165332  0.927382  0.783865
Texas   1.634361  1.155594  0.616263
Oregon  0.155251  0.508423  0.208182
b    1.479110
d    1.774761
e    1.480775
dtype: float64
Utah      2.004717
Ohio      0.762050
Texas     1.018098
Oregon    0.353172
dtype: float64
               b         d         e
Utah    1.710360  0.383368  0.485683
Ohio    0.027335  0.860038  0.614445
Texas   2.671137  1.335399  0.379780
Oregon  0.024103  0.258494  0.043340



In [6]:

    
np.random.randn(4, 3)









    Out[6]:





array([[ 0.43963213, -0.34952959, -2.20163386],
       [-0.92092226,  1.5762455 , -1.6822622 ],
       [ 0.01835854,  0.3780393 ,  2.61338955],
       [ 1.49569518,  1.37980042, -1.54020785]])



In [14]:

    
def my_func(x):
    return x.mean()
df2.apply(my_func, axis=1)









    Out[14]:





Utah     -0.002756
Ohio      0.625527
Texas     1.135406
Oregon    0.290619
dtype: float64

For Series we have:

apply: applies the function to each element of the Series
map: applies the function to each element of the Series

More about applymap, apply and map: http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas



In [9]:

    
s = pd.Series(np.arange(5))
print(s)
print(s.map(lambda x: x**2))
print(s.apply(lambda x: x**2))









    



0    0
1    1
2    2
3    3
4    4
dtype: int32
0     0
1     1
2     4
3     9
4    16
dtype: int64
0     0
1     1
2     4
3     9
4    16
dtype: int64

String Manipulation

(from http://pandas.pydata.org/pandas-docs/stable/text.html)

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array.

These methods exclude missing/NA values automatically and follow (more or less) the same syntax than the bulitin string object under the str naming space. Lets use the DataFrame from unstructured dataset:



In [21]:

    
df["Name"] = df["Name"].str.title()
print(df["Name"].str.len())
df









    



0    5
1    3
2    8
3    6
Name: Name, dtype: int64






    Out[21]:







  
    
      
      Name
      Age
      Time
    
  
  
    
      0
      Alice
      22
      23:20\n
    
    
      1
      Bob
      
      26:43\n
    
    
      2
      Charlie
      "22"
      24:09\n
    
    
      3
      David
      
      \n



In [22]:

    
df.columns









    Out[22]:





Index(['Name', 'Age', 'Time'], dtype='object')



In [23]:

    
df.columns.str.lower()









    Out[23]:





Index(['name', 'age', 'time'], dtype='object')

The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:



In [24]:

    
import numpy as np

df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],index=range(3))
df

Since df.columns is an Index object, we can use the .str accessor



In [25]:

    
print(df.columns.str.strip())

print(df.columns.str.lower())









    



Index(['Column A', 'Column B'], dtype='object')
Index([' column a ', ' column b '], dtype='object')

These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lowercasing all names, and replacing any remaining whitespaces with underscores:



In [26]:

    
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

Some things to note:

Notice that if variables can't be coearced to any other type, Pandas will coerce to str. Its important to know how to manipulate str
str operations can also be perfomed using apply.

Splitting and Replacing Strings

Methods like split return a Series of lists:



In [27]:

    
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')









    Out[27]:





0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

Elements in the split lists can be accessed using get or [] notation:



In [28]:

    
print(s2.str.split('_').str.get(1))
print(s2.str.split('_').str[1])









    



0      b
1      d
2    NaN
3      g
dtype: object
0      b
1      d
2    NaN
3      g
dtype: object

Easy to expand this to return a DataFrame using expand.



In [29]:

    
s2.str.split('_', expand=True)

It is also possible to limit the number of splits:



In [30]:

    
s2.str.split('_', expand=True, n=1)

rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:



In [31]:

    
s2.str.rsplit('_', expand=True, n=1)

Methods like replace and findall take regular expressions, too:



In [32]:

    
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3









    Out[32]:





0       A
1       B
2       C
3    Aaba
4    Baca
5        
6     NaN
7    CABA
8     dog
9     cat
dtype: object



In [33]:

    
s3.str.replace('^.a|dog', 'XX-XX ', case=False)









    Out[33]:





0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

Indexing with .str

We can access directly to all text positions in a text variable using str indexing with [] notation. If any strings are out of bounds, NaN is returned.



In [34]:

    
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
print(s.str[0])
print(s.str[1])









    



0      A
1      B
2      C
3      A
4      B
5    NaN
6      C
7      d
8      c
dtype: object
0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

Extracting Substrings

Extract first match in each subject (extract)

The extract method accepts a regular expression with at least one capture group.

Extracting a regular expression with more than one group returns a DataFrame with one column per group.



In [35]:

    
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)', expand=False)

The expand attribute to False returns a Series when possible. If we assign it to True a DataFrame will always be returned. If multiple regular expressions are grouped, the DataFrame is returned in both cases.



In [36]:

    
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])', expand=False)









    Out[36]:





0      a
1      b
2    NaN
dtype: object



In [37]:

    
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])', expand=True)

Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains NaN.

Named groups like



In [38]:

    
pd.Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)', expand=False)

and optional groups like



In [39]:

    
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)', expand=False)

Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.

To know more about expand attribute (e.g. when used with Index) refer to: http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings

Extract all matches in each subject (extractall)

The extractall method returns every match. The result of extractall is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.



In [ ]:

    
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)



In [ ]:

    
s.str.extractall(two_groups)

How to access multiindex?

You can think in multindex as a tupled index:



In [ ]:

    
mi_df = s.str.extractall(two_groups)
print(mi_df.columns)
print(mi_df.index)
print(mi_df.loc[("A",0)])

print(mi_df.T["A",0])

To know more: http://pandas.pydata.org/pandas-docs/stable/advanced.html

We can use xs accessor with level parameter to filter subindex values:



In [ ]:

    
extract_result = s.str.extract(two_groups, expand=True)
extractall_result = s.str.extractall(two_groups)
print(extractall_result)
print(extractall_result.xs(0, level="match"))
print(extractall_result.xs(1, level="match"))

Testing for Strings that Match or Contain a Pattern

You can check whether elements contain a pattern:



In [ ]:

    
pattern = r'[a-z][0-9]'
pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)

or match a pattern:



In [ ]:

    
pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)

Methods like match, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:



In [ ]:

    
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4.str.contains('A', na=False)
#explain that bool can be sumed

Approximate String Matching

(from: http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas)



In [ ]:

    
df1 = pd.DataFrame([[1],[2],[3],[4],[5]], 
                   index=['one','two','three','four','five'], 
                   columns=['number'])
df1



In [ ]:

    
df2 = pd.DataFrame([['a'],['b'],['c'],['d'],['e']], 
                   index=['one','too','three','fours','five'], 
                   columns=['letter'])
df2



In [ ]:

    
import difflib 

difflib.get_close_matches
df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

df1.join(df2)

Type Conversion

As already stated, when not possible to convert (coerce) to a defined data type, Pandas converts to str. We already worked with strings, we know how to apply functions element and column wise, so it won't be difficult to convert types by ourselves.

Pandas provide some functions to convert types:

to_numeric(arg[, errors]): Convert argument to a numeric type.
to_datetime(*args, kwargs):** Convert argument to datetime.
to_timedelta(*args, kwargs):** Convert argument to timedelta

to_numeric



In [41]:

    
s = pd.Series(["10","0.9","56","87'6","34'89"])



In [44]:

    
# we can raise, ignore or coerce
pd.to_numeric(s,errors="coerce")









    Out[44]:





0    10.0
1     0.9
2    56.0
3     NaN
4     NaN
dtype: float64



In [51]:

    
pd.to_numeric(s.str.replace("'","."),errors="coerce")









    Out[51]:





0    10.00
1     0.90
2    56.00
3    87.60
4    34.89
dtype: float64

Then we can use str.replace to solve this problem. Be careful at applying str over a Series. It has to contain obly str object, otherwise it will return NaN.



In [ ]:

    
pd.to_numeric(s.str.replace("'","."))



In [53]:

    
df.columns









    Out[53]:





Index(['column_a', 'column_b'], dtype='object')

If we recover the unstructured DataFrame, now we can convert it to the correct data type:



In [52]:

    
df["Name"] = df["Name"].str.title().str.strip()
df["Age"] = pd.to_numeric(df["Age"].str.replace("\"",""))
df









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-52-0fe7148a1af1> in <module>()
----> 1 df["Name"] = df["Name"].str.title().str.strip()
      2 df["Age"] = pd.to_numeric(df["Age"].str.replace("\"",""))
      3 df

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

~\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'

to_timedelta:

Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.



In [ ]:

    
#this does not work
#pd.to_timedelta(df["Time"].str.strip().map(lambda x: "00:"+x),errors="coerce")

#workaround
def str_convert(x):
    return pd.NaT if len(x)==0 else "00:"+x

pd.to_timedelta(df["Time"].str.strip().map(str_convert),box=True)

Technically Correct Data with Pandas

Data is stored in a DataFrame with suitable columns and row indexes
Each column of the DataFrame the Pandas type that adequately represents the value domain of the variable in the column

Exercices

Exercice: Read iqsize.csv using read_csv. Use na_values to assign missing data.

Exercise: Read country_info_worldbank.xls using read_xls. Use skiprows to get rid of empty lines.

Exercise: Try to load papers.lst reading it line by line taking into account that the fields are the following ones:

year
unknown
conference
authors (linked with &)
paper title

Exercise: Repeat the previous exercise taking advantage of the str functions.

Exercise: Load books.csv. Extract Author or Unknown. Then work with the title. Erase Irrelevant data and try to get as many authors and date as possible.

Exercise: Convert to Technically Correct Data: iqsize.csv.

Exercise: Convert to Technically Correct Data: country_info_worldbank.xls.

	Column A	Column B
0	0.408358	1.109967
1	0.774101	0.639438
2	0.519939	-0.346797

	column_a	column_b
0	0.408358	1.109967
1	0.774101	0.639438
2	0.519939	-0.346797

	Unnamed: 0	Age[years]	Sex	Weight[kg]	Eye Color	Body Temperature[C]
0	individuum 1	42	female	52.9	brown	36.9
1	individuum 2	37	male	87.0	green	36.3
2	individuum 3	29	male	82.1	blue	36.4
3	individuum 4	61	female	62.5	blue	36.7
4	individuum 5	77	female	55.5	gray	36.6
5	individuum 6	33	male	95.2	green	36.5
6	individuum 7	32	female	81.8	brown	37.0
7	individuum 8	45	male	78.9	brown	36.3
8	individuum 9	18	male	83.4	green	36.6
9	individuum 10	19	male	84.7	gray	36.1

	Unnamed: 0	Unnamed: 1	Economy	Code	Unnamed: 4	Region	Income group	Lending category	Other
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	NaN	Afghanistan	AFG	NaN	South Asia	Low income	IDA	HIPC
2	2	NaN	Albania	ALB	NaN	Europe & Central Asia	Upper middle income	IBRD	NaN
3	3	NaN	Algeria	DZA	NaN	Middle East & North Africa	Upper middle income	IBRD	NaN
4	4	NaN	American Samoa	ASM	NaN	East Asia & Pacific	Upper middle income	..	NaN
5	5	NaN	Andorra	ADO	NaN	Europe & Central Asia	High income	..	NaN
6	6	NaN	Angola	AGO	NaN	Sub-Saharan Africa	Upper middle income	IBRD	NaN
7	7	NaN	Antigua and Barbuda	ATG	NaN	Latin America & Caribbean	High income	IBRD	NaN